NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Youmu: Efficient Columnar Data Pipeline for LLM Training

Zhong, Tianle; Zhao, Jiechen; Su, Qiang; Fox, Geoffrey (February 2025, https://openreview.net/forum?id=I2LF8QHaua)

Large language models (LLMs) training is extremely data-intensive, often involving over trillion-level tokens. Although LLM datasets are usually ingested and stored in columnar formats, they often need to be converted into another format for training, which incurs significant storage and maintenance costs due to extra data copies. While eliminating the conversion would save tens of terabytes of space in costly high performance storage, this work identifies challenges that drive us to re-think the entire data pipeline. Without conversion, we find that fine-grained random access patterns incur hundreds of times efficiency drops. Specifically, the existing data pipelines have two fundamental drawbacks: (1) They cannot efficiently support directly digesting data in columnar format due to default coarse-grained I/O; (2) Solutions to the first drawback sacrifice memory footprint to cache datasets. In this paper, we present Youmu, a new data pipeline that directly feeds fine-grained columnar data into GPUs, enabling cost-efficient LLM training. Meanwhile, Youmu maintains high training accuracy, whose perplexity outperforms widely adopted local shuffle by reducing 0.3-0.7 for pretraining. Compared to performance-optimal state-of-the-art, distributed memory-based pipelines, Youmu achieves comparable throughput with 80% less memory footprint.
more » « less
Free, publicly-accessible full text available February 11, 2026
Supercharging distributed computing environments for high-performance data engineering

https://doi.org/10.3389/fhpcp.2024.1384619

Perera, Niranda; Sarker, Arup Kumar; Shan, Kaiying; Fetea, Alex; Kamburugamuve, Supun; Kanewala, Thejaka Amila; Widanage, Chathura; Staylor, Mills; Zhong, Tianle; Abeykoon, Vibhatha; et al (July 2024, Frontiers in High Performance Computing)

The data engineering and data science community has embraced the idea of using Python and R dataframes for regular applications. Driven by the big data revolution and artificial intelligence, these frameworks are now ever more important in order to process terabytes of data. They can easily exceed the capabilities of a single machine but also demand significant developer time and effort due to their convenience and ability to manipulate data with high-level abstractions that can be optimized. Therefore it is essential to design scalable dataframe solutions. There have been multiple efforts to be integrated into the most efficient fashion to tackle this problem, the most notable being the dataframe systems developed using distributed computing environments such as Dask and Ray. Even though Dask and Ray's distributed computing features look very promising, we perceive that the Dask Dataframes and Ray Datasets still have room for optimization In this paper, we present CylonFlow, an alternative distributed dataframe execution methodology that enables state-of-the-art performance and scalability on the same Dask and Ray infrastructure (superchargingthem!). To achieve this, we integrate ahigh-performance dataframesystem Cylon, which was originally based on an entirely different execution paradigm, into Dask and Ray. Our experiments show that on a pipeline of dataframe operators, CylonFlow achieves 30 × more distributed performance than Dask Dataframes. Interestingly, it also enables superior sequential performance due to leveraging the native C++ execution of Cylon. We believe the performance of Cylon in conjunction with CylonFlow extends beyond the data engineering domain and can be used to consolidate high-performance computing and distributed computing ecosystems.
more » « less
Full Text Available
Hybrid Cloud and HPC Approach to High-Performance Dataframes

https://doi.org/10.1109/BigData55660.2022.10020958

Shan, Kaiying; Perera, Niranda; Lenadora, Damitha; Zhong, Tianle; Kumar Sarker, Arup; Kamburugamuve, Supun; Amila Kanewela, Thejaka; Widanage, Chathura; Fox, Geoffrey (December 2022, 2022 IEEE International Conference on Big Data (Big Data))

Search for: All records